Skip to content

perf: Element-wise comparison only for tolerance-requiring data types#26

Open
Marius Merkle (MariusMerkleQC) wants to merge 8 commits intobenchmarkfrom
optimize
Open

perf: Element-wise comparison only for tolerance-requiring data types#26
Marius Merkle (MariusMerkleQC) wants to merge 8 commits intobenchmarkfrom
optimize

Conversation

@MariusMerkleQC
Copy link
Copy Markdown
Collaborator

@MariusMerkleQC Marius Merkle (MariusMerkleQC) commented Mar 27, 2026

Motivation

See this comment.

Changes

Introduce a function _needs_element_wise_comparison that checks whether element-wise comparison needs to be performed; this is the case for

(1) float vs numeric columns -> absolute and relative tolerances apply (-> _is_float_numeric_pair())
(2) temporal columns -> absolute temporal tolerance applies (-> _is_temporal_pair())

In all other cases, naive comparison suffices, and this shortcut is taken if the above helper returns False. This avoids the expensive _compare_sequence_columns(). The performance improvement can be seen in the benchmark test.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (ef9aa25) to head (9fd6079).

Additional details and impacted files
@@             Coverage Diff             @@
##           benchmark       #26   +/-   ##
===========================================
  Coverage     100.00%   100.00%           
===========================================
  Files             10        10           
  Lines            758       780   +22     
===========================================
+ Hits             758       780   +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes condition_equal_columns for nested list/array columns by avoiding the expensive element-wise comparison path when tolerances/special handling aren’t needed, and updates the performance benchmark accordingly.

Changes:

  • Add _needs_element_wise_comparison() (plus helpers) to decide when list/array columns require element-wise comparison.
  • Shortcut list/array comparisons to eq_missing() when element-wise handling is deemed unnecessary.
  • Update the performance test to assert comparable performance for list<i64> comparisons.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
diffly/_conditions.py Introduces dtype-based gating to skip element-wise list/array comparisons unless tolerances/special handling are needed.
tests/test_performance.py Updates benchmark expectations to ensure the optimized path is not significantly slower than direct eq_missing().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 124 to 139
# Otherwise, we simply compare all fields independently
return pl.all_horizontal(
[
_compare_columns(
col_left=col_left.struct[field],
col_right=col_right.struct[field],
dtype_left=fields_left[field],
dtype_right=fields_right[field],
max_list_length=max_list_length,
abs_tol=abs_tol,
rel_tol=rel_tol,
abs_tol_temporal=abs_tol_temporal,
)
for field in fields_left
]
)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also shortcut the comparison here?

# Otherwise, we simply compare all fields independently
if _needs_element_wise_comparison(dtype_left, dtype_right):
  return pl.all_horizontal(
    [
      _compare_columns(
        col_left=col_left.struct[field],
        col_right=col_right.struct[field],
        dtype_left=fields_left[field],
        dtype_right=fields_right[field],
        max_list_length=max_list_length,
        abs_tol=abs_tol,
        rel_tol=rel_tol,
        abs_tol_temporal=abs_tol_temporal,
      )
      for field in fields_left
    ]
  )
return col_left.eq_missing(col_right)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I added a performance test in #25, which already passes without the above optimization, so we should definitely not add it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants